Home Depot Product Search Relevance

The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters.

LabGraph Create

This notebook uses the LabGraph create machine learning iPython module. You need a personal licence to run this code.



In [1]:

    
import graphlab as gl

Load data from CSV files



In [2]:

    
train = gl.SFrame.read_csv("../data/train.csv")









    



[INFO] GraphLab Create v1.8.3 started. Logging: /tmp/graphlab_server_1456701323.log






    




Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv






    




Parsing completed. Parsed 74067 lines in 0.179878 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [3]:

    
test = gl.SFrame.read_csv("../data/test.csv")









    




Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv






    




Parsing completed. Parsed 100 lines in 0.213001 secs.






    




Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv






    




Parsing completed. Parsed 166693 lines in 0.330936 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [4]:

    
desc = gl.SFrame.read_csv("../data/product_descriptions.csv")









    




Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv






    




Parsing completed. Parsed 100 lines in 0.531025 secs.






    




Read 61134 lines. Lines per second: 58160.7






    




Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv






    




Parsing completed. Parsed 124428 lines in 1.65722 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [5]:

    
attr = gl.SFrame.read_csv("../data/attributes.csv")









    




Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/attributes.csv






    




Parsing completed. Parsed 100 lines in 0.739144 secs.






    




Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/attributes.csv






    




Parsing completed. Parsed 2044803 lines in 1.69493 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Data merging and feature engineering



In [6]:

    
# merge train with description
train = train.join(desc, on = 'product_uid', how = 'left')



In [7]:

    
# merge test with description
test = test.join(desc, on = 'product_uid', how = 'left')



In [8]:

    
# if some attributes has no so we don't need them
print len(attr)
attr = attr[attr['value'] != "No"]
print len(attr)



In [9]:

    
# if some attributes has "yes" we compy the value so we can search in it
attr['value'] = attr.apply(lambda x: x['name'] if x['value'] == "Yes" else x['value'])

Let's select brands



In [10]:

    
brands = attr[attr['name'] == "MFG Brand Name"]



In [11]:

    
brands.head()









    Out[11]:





    
        product_uid
        name
        value
    
    
        100001
        MFG Brand Name
        Simpson Strong-Tie
    
    
        100002
        MFG Brand Name
        BEHR Premium Textured
DeckOver ...
    
    
        100003
        MFG Brand Name
        STERLING
    
    
        100004
        MFG Brand Name
        Grape Solar
    
    
        100005
        MFG Brand Name
        Delta
    
    
        100006
        MFG Brand Name
        Whirlpool
    
    
        100007
        MFG Brand Name
        Lithonia Lighting
    
    
        100008
        MFG Brand Name
        Teks
    
    
        100009
        MFG Brand Name
        House of Fara
    
    
        100010
        MFG Brand Name
        Valley View Industries
    

[10 rows x 3 columns]

Bullets too



In [12]:

    
bullets = attr[attr['name'].contains("Bullet")]



In [13]:

    
# converting bullets to columns
bullets = bullets.unstack(column = ['name', 'value'], new_column_name = "bullets")
bullets = bullets.unpack("bullets")
bullets = bullets.sort("product_uid")
print len(bullets)



In [14]:

    
# merge train with brands and bullets
train = train.join(brands, on = 'product_uid', how = 'left')
train = train.join(bullets, on = 'product_uid', how = 'left')



In [15]:

    
# merge test with brands and bullets
test = test.join(brands, on = 'product_uid', how = 'left')
test = test.join(bullets, on = 'product_uid', how = 'left')

TF-IDF with linear regression



In [16]:

    
def calculateTfIdf(cols, data, searchColTfIdfName):
    for item in xrange(len(cols)):
        colName = cols[item]
        newColNameWordCount = colName + "_word_count"
        newColNameTfIdf = colName + "_tfidf"
        newColDistance = colName + "_distance"
        
        wordCount = gl.text_analytics.count_words(data[colName])
        data[newColNameWordCount] = wordCount
        
        tfidf = gl.text_analytics.tf_idf(data[newColNameWordCount])
        data[newColNameTfIdf] = tfidf
        #print colName
        if searchColTfIdfName != colName:
            data[newColDistance] = data.apply(lambda x: 0 if x[newColNameTfIdf] is None else gl.distances.cosine(x[searchColTfIdfName],x[newColNameTfIdf]))
        
    return data



In [17]:

    
# columns = ['search_term', 'product_title', 'product_description', 'value', 'bullets.Bullet01',
#          'bullets.Bullet02', 'bullets.Bullet03', 'bullets.Bullet04', 'bullets.Bullet05', 'bullets.Bullet06'
#          , 'bullets.Bullet07', 'bullets.Bullet08', 'bullets.Bullet09', 'bullets.Bullet10', 'bullets.Bullet11'
#          , 'bullets.Bullet12', 'bullets.Bullet13', 'bullets.Bullet14', 'bullets.Bullet15', 'bullets.Bullet16'
#          , 'bullets.Bullet17', 'bullets.Bullet18', 'bullets.Bullet19', 'bullets.Bullet20', 'bullets.Bullet21'
#          , 'bullets.Bullet22']

columns = ['search_term', 'product_title', 'product_description', 'value']

train = calculateTfIdf(columns, train, 'search_term_tfidf')



In [18]:

    
test = calculateTfIdf(columns, test, 'search_term_tfidf')



In [19]:

    
featuresDistance = [s for s in train.column_names() if "distance" in s]
print featuresDistance









    



['search_term_distance', 'product_title_distance', 'product_description_distance', 'value_distance']



In [20]:

    
#train = train.dropna('value_distance')



In [21]:

    
model1 = gl.linear_regression.create(train, target = 'relevance', features = featuresDistance)









    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 70282






    




Number of features          : 4






    




Number of unpacked features : 4






    




Number of coefficients    : 5






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| 1         | 2        | 1.033884     | 1.933607           | 1.718643             | 0.507793      | 0.501934        |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




SUCCESS: Optimal solution found.






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [22]:

    
#let's take a look at the weights before we plot
model1.get("coefficients")









    Out[22]:





    
        name
        index
        value
        stderr
    
    
        (intercept)
        None
        3.34869888667
        0.0151876483122
    
    
        search_term_distance
        None
        -0.0174618514756
        1.47829769014e+13
    
    
        product_title_distance
        None
        -0.595348024821
        0.0128923155948
    
    
        product_description_dista
nce ...
        None
        -0.483697113003
        0.0199678645082
    
    
        value_distance
        None
        -0.127583244653
        0.00464255002758
    

[5 rows x 4 columns]



In [23]:

    
'''
predictions_test = model1.predict(test)
test_errors = predictions_test - test['relevance']
RSS_test = sum(test_errors * test_errors)
print RSS_test
'''









    Out[23]:





"\npredictions_test = model1.predict(test)\ntest_errors = predictions_test - test['relevance']\nRSS_test = sum(test_errors * test_errors)\nprint RSS_test\n"



In [24]:

    
predictions_test = model1.predict(test)
predictions_test









    Out[24]:





dtype: float
Rows: 166693
[2.163608582030827, 2.1420705041883856, 2.3702256434364006, 2.375329695996665, 2.3321203991937733, 2.1420705041883856, 2.3519518753144855, 2.359421253304835, 2.178561192643488, 2.666910931658947, 2.507026409672497, 2.412125970203885, 2.5170007078538994, 2.3258349990880607, 2.2242032327035375, 2.3941878695971086, 2.1420705041883856, 2.6140855624161623, 2.1699813789661144, 2.2065949192062444, 2.717407736231517, 2.712865824093638, 2.1779403904932306, 2.337502788278624, 2.1420705041883856, 2.2617218748848984, 2.1420705041883856, 2.3199329871957146, 2.2344124134245718, 2.385991303459093, 2.5326287324153878, 2.4292165976367617, 2.2066803567199864, 2.335888999500826, 2.19339399150458, 2.1420705041883856, 2.2400855057732123, 2.167872304387674, 2.686237771842258, 2.5971778998122836, 2.1632514252912123, 2.3006226583130776, 2.2100150660265516, 2.1420705041883856, 2.4489243449769083, 2.3715994152817577, 2.423937656116298, 2.4740470887648307, 2.6614303865830693, 2.2622184754650583, 2.1495636190535095, 2.1420705041883856, 2.170407029327005, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.14249141903142, 2.1420705041883856, 2.1420705041883856, 2.174982070139289, 2.1752787927457464, 2.1420705041883856, 2.2784681803740305, 2.482949286137385, 2.2329231441394812, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.2786148712217598, 2.1420705041883856, 2.617211637992733, 2.4196822993012703, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.470706963362782, 2.2446674927471233, 2.325098349640318, 2.2099398367301855, 2.212277355188578, 2.3397732932525357, 2.4953560472968697, 2.5118006031281985, 2.442928045771276, 2.536508440492266, 2.3429026352796365, 2.1420705041883856, 2.610918338740749, 2.3293265197071764, 2.267742429379897, 2.244849860123413, 2.422150849969186, 2.3325964347766477, 2.341568501633621, ... ]



In [25]:

    
submission = gl.SFrame(test['id'])



In [26]:

    
submission.add_column(predictions_test)
submission.rename({'X1': 'id', 'X2':'relevance'})









    Out[26]:





    
        id
        relevance
    
    
        1
        2.16360858203
    
    
        4
        2.14207050419
    
    
        5
        2.37022564344
    
    
        6
        2.375329696
    
    
        7
        2.33212039919
    
    
        8
        2.14207050419
    
    
        10
        2.35195187531
    
    
        11
        2.3594212533
    
    
        12
        2.17856119264
    
    
        13
        2.66691093166
    

[166693 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [27]:

    
submission['relevance'] = submission.apply(lambda x: 3.0 if x['relevance'] > 3.0 else x['relevance'])
submission['relevance'] = submission.apply(lambda x: 1.0 if x['relevance'] < 1.0 else x['relevance'])



In [28]:

    
submission['relevance'] = submission.apply(lambda x: str(x['relevance']))



In [29]:

    
submission.export_csv('../data/submission.csv', quote_level = 3)



In [ ]:

    
#gl.canvas.set_target('ipynb')

product_uid	name	value
100001	MFG Brand Name	Simpson Strong-Tie
100002	MFG Brand Name	BEHR Premium Textured DeckOver ...
100003	MFG Brand Name	STERLING
100004	MFG Brand Name	Grape Solar
100005	MFG Brand Name	Delta
100006	MFG Brand Name	Whirlpool
100007	MFG Brand Name	Lithonia Lighting
100008	MFG Brand Name	Teks
100009	MFG Brand Name	House of Fara
100010	MFG Brand Name	Valley View Industries

name	index	value	stderr
(intercept)	None	3.34869888667	0.0151876483122
search_term_distance	None	-0.0174618514756	1.47829769014e+13
product_title_distance	None	-0.595348024821	0.0128923155948
product_description_dista nce ...	None	-0.483697113003	0.0199678645082
value_distance	None	-0.127583244653	0.00464255002758

id	relevance
1	2.16360858203
4	2.14207050419
5	2.37022564344
6	2.375329696
7	2.33212039919
8	2.14207050419
10	2.35195187531
11	2.3594212533
12	2.17856119264
13	2.66691093166